In [1]:
# all imports here
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import os
import glob
import re
import math
import calendar
import datetime
import random
from calendar import monthrange
from sklearn import datasets, linear_model, ensemble
from sklearn.model_selection import train_test_split
%matplotlib inline
In [2]:
DATA_FOLDER = './Data' # Use the data folder provided in Tutorial 02 - Intro to Pandas.("/")
The DATA_FOLDER/ebola
folder contains summarized reports of Ebola cases from three countries (Guinea, Liberia and Sierra Leone) during the recent outbreak of the disease in West Africa. For each country, there are daily reports that contain various information about the outbreak in several cities in each country.
Use pandas to import these data files into a single Dataframe
.
Using this DataFrame
, calculate for each country, the daily average per month of new cases and deaths.
Make sure you handle all the different expressions for new cases and deaths that are used in the reports.
In [3]:
country_folder_names = ['guinea_data', 'liberia_data', 'sl_data']
country_paths = [DATA_FOLDER + '/ebola/' + name for name in country_folder_names]
# all files about the ebola task (by country)
all_ebola_files = [glob.glob(os.path.join(path, "*.csv")) for path in country_paths]
In [4]:
guinea_files = all_ebola_files[0]
guinea_df = pd.concat([pd.read_csv(path) for path in guinea_files], axis=0).rename(
columns={'Date': 'Date', 'Description': 'Info', 'Totals': 'Total'})
guinea_df['Country'] = 'Guinea'
#'Total deaths of confirmed' | 'Total cases of confirmed'
liberia_files = all_ebola_files[1]
liberia_df = pd.concat([pd.read_csv(path) for path in liberia_files], axis=0).rename(
columns={'Date': 'Date', 'Variable': 'Info', 'National': 'Total'})
liberia_df['Country'] = 'Liberia'
# 'Total death/s in confirmed cases' | 'Total confirmed cases'
sl_files = all_ebola_files[2]
sl_df = pd.concat([pd.read_csv(path) for path in sl_files], axis=0).rename(
columns={'date': 'Date', 'variable': 'Info', 'National': 'Total'})
sl_df['Country'] = 'Sierra Leone'
#'death_confirmed' | 'cum_confirmed' || totals
We then merge the three Dataframes into one and replace missing values by 0.
In [5]:
ebola_df = pd.concat(
[
guinea_df,
liberia_df,
sl_df
],
axis=0
)
# replace missing values by 0
ebola_df.fillna('0', inplace=True)
The values in the date column are not all in the same format, therefore we need to uniformize their format.
Then, we set the index of the Dataframe
into a combination of the country and the date.
In [6]:
# unify dates
ebola_df['Date'] = pd.to_datetime(ebola_df['Date'])
# build index
ebola_df.index = [ebola_df['Country'], ebola_df['Date']]
ebola_df.index.rename(['Country', 'Date_index'], inplace=True)
In [7]:
# displaying some
ebola_df[:5]
Out[7]:
In [8]:
ebola_df = ebola_df[['Total', 'Info']]
In [9]:
# displaying some rows
ebola_df[:5]
Out[9]:
We will assume from now on that:
We could have chosen the daily counts instead of cumulative counts. However, they were not always given in all countries which means that we couldn't have used a uniform method of computation for every country.
Moreover, using culumative is sufficient and easier. It is also more consistent.
Following our assumptions, we need to keep the results concerned by the mentionned descriptions only.
Furthermore, some entries containing the data do not only constitute of numbers, but of '%' or ',', those characters need to be removed.
In [10]:
deaths_info_to_keep = ['Total deaths of confirmed', 'Total death/s in confirmed cases', 'death_confirmed']
cases_info_to_keep = ['Total cases of confirmed', 'Total confirmed cases', 'cum_confirmed']
ebola_df['Total'] = pd.to_numeric(ebola_df['Total'].replace(',|%', '', regex=True))
ebola_df['Deads'] = np.where(ebola_df['Info'].isin(deaths_info_to_keep), ebola_df['Total'], 0)
ebola_df['Cases'] = np.where(ebola_df['Info'].isin(cases_info_to_keep), ebola_df['Total'], 0)
# displaying some data the dataframe
ebola_df.head(20)
Out[10]:
In [11]:
countries = ['Guinea', 'Liberia', 'Sierra Leone']
infos = ['Deads', 'Cases']
# we don't need the "Total" and "Info" columns anymore
ebola_infos_df = ebola_df[infos]
In [12]:
# plotting data by country
for country in countries:
ebola_infos_df.loc[country].groupby(['Date_index']).agg(sum).plot(title=country)
As we can see above, lots of data is missing where y = 0, we will need to ignore those point in the future. Moreover, some points seem to be incorrect unless one can resuscitate.
To get better estimations and ignore unrealistic values, we will create and train a model to detect outliers and create another one to create an extrapolation. We will use a RANSAC Regressor to detect the outliers and an ExtraTreesRegressor to extrapolate the values.
We could have used a LinearRegressor instead of the ExtraTreesRegressor, this would have lead to a better approximation of the trend and it could also compensate human error induced during data fetching. However, we are more interested in the exact daily average given by the data knowing that outliers are taken care of by the RANSAC Regressor and thus the error is minimal enough to be ignored.
In [13]:
day_offset = datetime.datetime(2014, 1, 1,)
def days_in_month(year, month):
return monthrange(year, month)[1]
def days_in_interval(start, end):
return (end.total_seconds() - start.total_seconds()) / (3600 * 24)
# example : delta of 35 days would return "February"
def days_delta_to_month(days):
return calendar.month_name[math.floor(days/30) + 1]
Since data is sometimes missing, we then extract the intervals for each country where we have data about the number of cases, and where we have the number of deads.
For each of those intervals:
In [14]:
def build_interval_by_month(start, end):
assert(start.year == end.year) # works within the same year only
assert(start.month < end.month ) # can't go backwards or same month
interval = []
# corner case #1 : start.day is not the first of the month
interval.append([
datetime.datetime(start.year, start.month, start.day),
datetime.datetime(start.year, start.month, days_in_month(start.year, start.month))
])
for month_idx in range(start.month + 1, end.month):
interval.append([
datetime.datetime(start.year, month_idx, 1),
datetime.datetime(start.year, month_idx, days_in_month(start.year, month_idx))
])
# corner case #2 : end.day in not necessary the last of the month
interval.append([
datetime.datetime(end.year, end.month, 1),
datetime.datetime(end.year, end.month, end.day)
])
return [[date-day_offset for date in dates] for dates in interval]
In [15]:
intervals_of_interest = {}
for country in countries:
intervals_of_interest[country] = {}
for info in infos:
agg_data = ebola_infos_df.loc[country].groupby(['Date_index']).agg(sum)
agg_data_greater_zero = agg_data[agg_data[info]>0]
start = agg_data_greater_zero.index[0]
end = agg_data_greater_zero.index[-1]
intervals_of_interest[country][info] = build_interval_by_month(start, end)
To train our models, we created some new features to complement the only one we have. The primary feature being the date in seconds of the record since a previously decided day offset (beginning of the year) divided by 100'000.
In [16]:
def get_features(dates):
X = pd.DataFrame(
data=[date.total_seconds() for date in dates],
index=range(len(dates)),
columns=['Date_index']
)
X["Date_index"] = X["Date_index"] / 100000.
X["log"] = X["Date_index"].apply(np.log)
X["Date1/2"] = X["Date_index"]**(1/2.)
X["Date^2"] = X["Date_index"]**2
return X
In [17]:
# note : it also plots the predictions
def get_model(country, info):
agg_data = ebola_infos_df.loc[country].groupby(['Date_index']).agg(sum)
agg_data_greater_zero = agg_data[agg_data[info] > 0]
delta = pd.DataFrame(agg_data_greater_zero[info].index)["Date_index"] - day_offset
X = get_features(delta.tolist())
reg1 = linear_model.RANSACRegressor(random_state=1024)
reg2 = ensemble.ExtraTreesRegressor(random_state=1411)
reg1.fit(X, agg_data_greater_zero[info])
inlier_mask = reg1.inlier_mask_
outlier_mask = np.logical_not(inlier_mask)
x_train, x_test, y_train, y_test = train_test_split(X[inlier_mask], agg_data_greater_zero[info][inlier_mask], test_size=0.0)
reg2.fit(x_train, y_train)
pred_df = pd.DataFrame(reg2.predict(X)).rename(columns={0:"Prediction"})
pred_df["Real Values (zeroes filtered)"] = agg_data_greater_zero[info].values
pred_df["Date"] = agg_data_greater_zero.index
pred_df.plot(x="Date", title=country+' - '+info)
return reg2
In [18]:
def plot_info_per_month(intervals, plot_name):
intervals_df = pd.DataFrame(
data=[interval[0] for interval in intervals],
index=[interval[1] for interval in intervals],
)
intervals_df.index.name = "Months"
intervals_df.plot(kind="bar", title=plot_name, legend=False)
return intervals_df
In [19]:
pred = []
final_df = pd.DataFrame()
for country in countries:
for info in infos:
model = get_model(country, info)
intervals = []
for interval in intervals_of_interest[country][info]:
features = get_features(interval)
pred.append(model.predict(features))
if(interval[0] == interval[1]):
intervals.append([pred[-1][1] - pred[-2][1], days_delta_to_month(interval[0].days)])
else:
intervals.append([
(pred[-1][1] - pred[-1][0]) / days_in_interval(interval[0], interval[1]),
days_delta_to_month(interval[0].days)
])
temp_df = plot_info_per_month(intervals, country + " - " + info + " - daily average per month")
temp_df['Country'] = country
temp_df['Info'] = info
final_df = pd.concat([final_df, temp_df], axis=0)
final_df.index = [final_df.index, final_df['Country'], final_df['Info']]
final_df = final_df[[0]].rename(
columns={0: 'Count'})
final_df.groupby(['Country', 'Info', 'Months']).head()
Out[19]:
In the DATA_FOLDER/microbiome
subdirectory, there are 9 spreadsheets of microbiome data that was acquired from high-throughput RNA sequencing procedures, along with a 10th file that describes the content of each.
Use pandas to import the first 9 spreadsheets into a single DataFrame
.
Then, add the metadata information from the 10th spreadsheet as columns in the combined DataFrame
.
Make sure that the final DataFrame
has a unique index and all the NaN
values have been replaced by the tag unknown
.
-> We import the metadata into a dataframe
-> We import each data files (MIDx.xls) into dataframes (1 file = 1 dataframe)
note: we handle the first column as row indices and then use it to concat the different dataframes
-> For each data files, we add their respective index that we get from the dataframe of the metadata
note: it's said we must add it as columns, we assumed it was as column names for each data source which makes lots of sense
-> We concat rows of the MIDs dataframes in one single dataframe
-> We replace NaN values by "unknown"
reading all filenames in the folder ending with .xls
In [20]:
task2_files = glob.glob(os.path.join(DATA_FOLDER+'/microbiome/', "*.xls"))
task2_files.sort()
task2_files
Out[20]:
Thanks to the sort, we can see that the metadata is at the end of the list. Hence we extract it using this info.
separating data file paths (0 to 8) and metadata file path (last = 9)
In [21]:
metadata_file = task2_files[-1]
mids_files = task2_files[:-1]
importing metadata file into a dataframe and showing it
In [22]:
metadata_df = pd.read_excel(metadata_file, index_col=None)
metadata_df
Out[22]:
note: we can check the that the order of the rows is the same as the order of the files (MID1 -> MID9).
This makes it easy to associate file with its corresponding colum name using indices :
In [23]:
mids_df = []
for idx, file in enumerate(mids_files):
mids_df.append(pd.read_excel(file, index_col=0, header=None))
mids_df[idx].columns = [[index] for index in metadata_df.loc[range(len(metadata_df)), ["GROUP", "SAMPLE"]].T[idx]]
mids_df[3][:5]
Out[23]:
(above : showing dataframe samples of file MID4.xls)
concat of the 9 dataframes into one (concat by row) and NA are being replaced by "unknown"
In [24]:
mids_df_concat = pd.concat(mids_df, axis=1)
mids_df_concat.fillna(value="unknown", inplace=True)
mids_df_concat
Out[24]:
(above : showing final dataframe)
In [25]:
from IPython.core.display import HTML
HTML(filename=DATA_FOLDER+'/titanic.html')
Out[25]:
For each of the following questions state clearly your assumptions and discuss your findings:
Categorical
. DataFrame
with unique index.
In [26]:
data = pd.read_excel(DATA_FOLDER+'/titanic.xls')
In [27]:
data.head(5)
Out[27]:
In [28]:
data.describe(include='all')
Out[28]:
In [29]:
data['pclass'] = data.pclass.astype('category')
data['sex'] = data.sex.astype('category')
data['embarked'] = data.embarked.astype('category')
In [30]:
sns.countplot(x="pclass", data=data, palette="Greens_d")
Out[30]:
In [31]:
sns.countplot(x="embarked", data=data, palette="Greens_d")
Out[31]:
In [32]:
sns.countplot(x="sex", data=data, palette="Greens_d")
Out[32]:
For the discrete interval, we decide to divide the ages in 9 intervals: 1 for each decades from 0 to 90 years old. Thanks to the use of panda cut function we do not have to bother about NAs.
In [33]:
sns.countplot(x=pd.cut(data.age, [0,10,20,30,40,50,60,70,80,90], right=False), palette="Greens_d")
Out[33]:
Let's first plot a pie for every values we have for the cabin floors.
In [34]:
data['cabin'].astype(str).str[0].value_counts(sort=False).plot(kind='pie')
Out[34]:
As we can see above, around 3/4 of the data is missing for this attribute. We need to ignore the missing values to have a better view of the pie chart. Note that the cabin floor T might seem to be a typo, we will assume it refers to the tank top that one can find on a titanic deckplan and hence keep the value.
In [35]:
data['cabin'].astype(str).str[0].value_counts()[1:].plot(kind='pie')
Out[35]:
In [36]:
df=data.groupby("pclass").survived.agg(["sum", lambda x: len(x) - sum(x)])
df.columns=["survived", "dead"]
df=df.T
df.columns=["1st class", "2nd class", "3rd class"]
df.plot(kind="pie", subplots=True, figsize=(18, 6))
Out[36]:
Seaborn allows us to plot such a graph. The black bars show the variance.
In [37]:
sns.barplot(x="sex", y="survived", hue="pclass", data=data);
To split the datas in two equally populated age categories, we use the median and group the data on both side of the median. Obviously some datas will reside in the median and hence we will not get two perfectly equally populated categories but we assume that considering the number of datas we have, we condiser a small delta is acceptable
In [38]:
df = data[['pclass', 'sex', 'age', 'survived']][data['age'].isnull() != True]
median_age = df[['age']].median()
df['age_cat'] = np.where(df[['age']] > median_age, '>{}'.format(median_age[0]), '<{}'.format(median_age[0]))
df = df[['pclass', 'sex', 'age_cat', 'survived']]
df.groupby(['age_cat', 'pclass', 'sex']).agg(np.mean)
Out[38]: